Click here to return to the main CU psychology R tutorials page.
ggplot is a package in R that allows for highly customizable and pretty plots! Here, were going to learn a few basics of making plots using ggplot that will hopefully get you well on your way to making informative and beautiful data visualizations!
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
We’re going to practice here on a dataset from the 1990 NHANES (National Health and Nutrition Examination Survey). The variables are below.
Region - Geographic region in the USA: Northeast (1), Midwest (2), South (3), and West (4)
Sex - Biological sex: Male (1), Female (2)
Age - Age measured in months (we’ll convert this to years below)
Urban - Residential population density: Metropolital Area (1), Other (2)
Weight - Weight in pounds
Height - Height in inches
BMI - BMI, measured in kg/(m^2)
nhanes <- read.csv("NHANES1990.csv", stringsAsFactors = F)
nhanes$Age <- nhanes$Age/12 # convert age to years for convinience
# Recoding factors
nhanes$Urban <- dplyr::recode(nhanes$Urban, '1' = 'Metro Area', '2' = 'Non-Metro Area')
nhanes$Region <- dplyr::recode(nhanes$Region, '1' = 'Northeast', '2' = 'Midwest', '3' = 'South', '4' = 'West')
head(nhanes)
## Region Sex Age Urban Weight Height BMI
## 1 South 2 42.75000 Non-Metro Area 171.7 65.3 28.4
## 2 West 1 25.58333 Non-Metro Area 155.2 62.3 28.2
## 3 West 2 73.83333 Metro Area 166.7 59.2 33.5
## 4 West 1 38.16667 Metro Area 224.7 71.9 30.6
## 5 Midwest 1 74.00000 Non-Metro Area 245.0 67.7 37.6
## 6 West 2 2.75000 Metro Area 28.3 35.2 16.0
-ggplot() command usage including saving to a variable with a <- ggplot()
-The general format is ggplot(data, aes(x = [x axis variable], y = [y axis variable]) - x and y variables are always specified in this aes() subfunction
Run this line of code
ggplot(nhanes, aes(x = Age, y = Weight))
We need to tell the ggplot() call what kind of graphic to put on the axis
geom_[something]So lets do scatter with geom_point() first:
ggplot(nhanes, aes(x = Age, y = Weight)) + geom_point()
Wow, lots of data points! Maybe we can make the points smaller to see better
ageWeightPlot <- ggplot(nhanes, aes(x = Age, y = Weight)) +
geom_point(size = .1, alpha = .3)
ageWeightPlot
Remember, it’s a good habbit to save your plots to objects, not just draw them!
ageWeightPlot <- ggplot(nhanes, aes(x = Age, y = Weight)) +
geom_point(size = 1, alpha = .2, aes(color = factor(Region), pch = factor(Region)))
weightHistogram <- ggplot(nhanes) + geom_histogram(aes(x = Weight), bins = 100, fill = 'purple')
weightHistogram
Question: what if we want to look at distribution of weights by region in the nhanes data?
With ggplot, if x is a factor (discrete, not continuous) we can plot dots as a function of the factor
heightByUrban <- ggplot(nhanes, aes(x = Urban, y = Height)) +
geom_point()
heightByUrban
Woah! So much data, hard to see anything, let’s use geom_jitter() with size = .05, width = .1 instead
heightByUrban <- ggplot(nhanes, aes(x = Urban, y = Height)) + geom_jitter(size = .05, width = .1)
heightByUrban
Still a TON of data! Here’s a great tool I really like for plotting the density of distributions of points like this!
# We can use 'source' to pull bits of code from github, as well as local files
source("https://gist.githubusercontent.com/benmarwick/2a1bb0133ff568cbe28d/raw/fb53bd97121f7f9ce947837ef1a4c65a73bffb3f/geom_flat_violin.R")
heightByUrban + geom_flat_violin(,
position = position_nudge(x = .15, y = 0), alpha = .7)
Now, we can really begin to se the skewedness of these weight distributions.
Plotting data points is all well and good, but what if we want to use our plots to summarize distributions? We’ll do that here:
weightByRegion <- ggplot(nhanes, aes(x = Region, y = Weight)) + stat_summary(fun.y= "mean", geom = "point")
weightByRegion
# People might want to know how to add a confidence interval
weightByRegionConf <- ggplot(nhanes, aes(x = Region, y = Weight)) + stat_summary(fun.data = "mean_cl_boot", fun.args=list(conf.int=.95))
weightByRegionConf
Let’s say we think there might be a linear relationship between height and weight
We can use geom_smooth for this and method = 'lm specifically for a linear model Also level = .95 can specify confidence interval about the estimate at each x value
heightByWeight <- ggplot(nhanes, aes(x = Height, y = Weight)) + geom_point() + stat_smooth(method = 'lm')
heightByWeight
Hmm, this actually looks like it’s giving us some pretty bad predictions. We’re not going to get into the stats of this now, but we can also plot using auto which is a mix of models, and might be a bit smarter
heightByWeight <- ggplot(nhanes, aes(x = Height, y = Weight)) + geom_point() + stat_smooth(method = 'auto', level = .99)
heightByWeight
## `geom_smooth()` using method = 'gam'
The labs() command can be added to ggplot with different arguments, lke x, y, or title to make the plots clearer
heightByUrban + labs(x = 'Neighborhood', y = 'Height in Inches', title = 'Height by Urban States')
Note, if we want to change labels for FACTORS, not just axes, it’s easier to do that using the tidyverse
Its useful to have several plots in a panel sometimes, not just one.
So for this data set, say we want to plot relationships between height and weight, but by region
We can do this with facet_wrap('Region')
facetPlot <- ggplot(nhanes, aes(x = Height, y = Weight)) +
geom_point(alpha = .2, size = .1) +
stat_smooth() +
facet_wrap('Region')
facetPlot
## `geom_smooth()` using method = 'gam'
We can even do multiple factors
multiFacet <- ggplot(nhanes, aes(x = Height, y = Weight)) +
geom_point(alpha = .2, size = .5, aes(color = Urban)) +
stat_smooth() +
facet_wrap(c('Region', 'Urban'), scales = 'free_y')
multiFacet
## `geom_smooth()` using method = 'gam'
We can color either continuously or discretely, depending on how a variable is represented in r
We put col = Height into the aes() function because it’s a grouping factor
Continous Example
weightByRegion <- ggplot(nhanes, aes(x = Region, y = Weight, col = Height)) +
geom_jitter(width = .1) +
labs(x = 'Region of US', y = 'Weight (lbs)', title = 'Weight by Region')
weightByRegion
Discrete Example
discretePlot <- ggplot(nhanes, aes(x = Height, y = Weight, col = Region)) + geom_point()
discretePlot
It is EASY to choose your own custom colors (as well as using R presets), but we’re not going to get into that right at this moment
We can use themes to make our plots prettier, and also customize the gridlines a lot
discretePlot + theme_bw()
discretePlot + theme_minimal()
discretePlot+ theme_void()
discretePlot + theme_classic()
facetPlot + theme_bw() + labs(title = 'Weight by Height and Region')
## `geom_smooth()` using method = 'gam'
Arguments for file , plot, dpi, width, and height
We can use a variety of file formats
ggsave('newPlotTest.pdf', plot = facetPlot, dpi = 300, width = 5, height = 5)
Sometimes we might want to make a heatmap to look at the value of variable based on a 2d grid of of two other variables.
This is kind of a silly example, but say we wanted to map out the number of observations in our dataset as a factor of region and neighborhood type
# group the data
nhanesGroup <- nhanes %>%
dplyr::group_by(Region, Urban) %>%
dplyr::summarise(Observations = n())
ggplot(nhanesGroup, aes(x = Urban, y = Region)) +
geom_tile(aes(fill = Observations)) +
scale_fill_gradient(low = "blue", high = "red") +
theme_bw()
Working with a different dataset here
Lets make up some very simple data on the prices of two different items from 1978-2017
years <- 1978:2017
item1<- rnorm(40,100,5)
item2 <- 1:40 + rnorm(40,100,5)
# Helpful to put it in long form using gather
timeFrame <- data.frame(years, item1, item2) %>%
tidyr::gather(., key = 'item', value = 'price', c(item1, item2)) %>%
mutate(se = runif(nrow(.), 1,5))
We can plot the time series using geom_point() and connect the times using geom_line(), coloring by item
ggplot(timeFrame, aes(x = years, y = price, color = item)) +
geom_point() +
geom_line() +
theme_bw()
Now, let’s get some representation of our uncertainty into the plot! Notice that there is an included ‘se’ column for the standard error of each observation. Let’s plot error bars of +/- 1 standard error above and below each point.
ymin and ymax argumentsggplot(timeFrame, aes(x = years, y = price, color = item)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = price - se, ymax = price + se), width = 0) +
theme_bw() +
labs(x = 'Year', y = 'Price', color = 'Item')
Alternatively, we can use shading to express our uncertainty more continuously. This time, lets shade the error within 2 standard errors of each measured point
ggplot(timeFrame, aes(x = years, y = price)) +
geom_point(aes(color = item)) +
geom_line(aes(color = item)) +
geom_ribbon(aes(ymin = price - 2*se, ymax = price + 2*se, fill = item),
alpha = .2, show.legend = F) +
theme_bw() +
labs(x = 'Year', y = 'Price', color = 'Item') +
scale_color_brewer(palette = 'Dark2') +
scale_fill_brewer(palette = 'Dark2')
Notice we’ve had to reformat a few calls here to adjust the aesthetic mapping…this happens sometimes when we want certain mappings to apply to ONLY certain parts of the plot. When we put the aes() call inside a geom() call, the mapping applies only to that geometrical object.
Remember! It’s very important to always display the predictive uncertainty along with the estimates or mean predicted by your model. Otherwise, we don’t have any idea of how confident the model’s predictions are.
Ggplot basic will get you a LONG way.
Also, there is much more ggplot can do for making your plots very pretty, and also plotting lots of complex models
Unlike excel and spss, which can often be cranky and difficult to bend to your will in customizing plots, ggplot is really easy to work with to make your graph look the way you want